🕸️ Ada Research Browser

certificate-expiry.md
← Back

Runbook: Certificate Expiry

Alert

Severity

Critical -- Expired certificates cause TLS failures across the platform, breaking ingress traffic, inter-service communication, and webhook connectivity.

Impact

Investigation Steps

  1. List all certificates and their status:
kubectl get certificates -A
  1. Check for certificates that are not ready or near expiry:
kubectl get certificates -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter,RENEWAL:.status.renewalTime'
  1. Describe the failing certificate for detailed condition messages:
kubectl describe certificate <certificate-name> -n <namespace>
  1. Check cert-manager controller logs for errors:
kubectl logs -n cert-manager deployment/cert-manager --tail=100
  1. Check the CertificateRequest resources associated with the failing certificate:
kubectl get certificaterequest -n <namespace>
kubectl describe certificaterequest <name> -n <namespace>
  1. Check the Order and Challenge resources (for ACME/Let's Encrypt issuers):
kubectl get orders -A
kubectl get challenges -A
  1. Verify the ClusterIssuer or Issuer is ready:
kubectl get clusterissuers
kubectl describe clusterissuer <issuer-name>
  1. Check cert-manager webhook health:
kubectl get pods -n cert-manager
kubectl logs -n cert-manager deployment/cert-manager-webhook --tail=50
  1. Check the cert-manager HelmRelease status:
flux get helmrelease cert-manager -n cert-manager

Resolution

Certificate stuck in not-ready state

  1. Delete the failing CertificateRequest to trigger a new one:
kubectl delete certificaterequest <name> -n <namespace>
  1. Force cert-manager to re-issue by adding a temporary annotation:
kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate="true" --overwrite
  1. Then remove it to trigger the real issuance:
kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate-

ClusterIssuer not ready (self-signed CA)

  1. Check the CA secret exists:
kubectl get secret -n cert-manager | grep ca
  1. If the root CA secret is missing, recreate it. Check the ClusterIssuer spec for the expected secret name:
kubectl describe clusterissuer sre-ca-issuer
  1. Re-apply the cert-manager manifests via Flux:
flux reconcile helmrelease cert-manager -n cert-manager

ACME challenge failure (Let's Encrypt)

  1. Check challenge status:
kubectl describe challenge <name> -n <namespace>
  1. Verify DNS is resolving correctly for the domain
  2. Verify the HTTP-01 solver can reach the challenge endpoint (check Istio gateway and VirtualService)

Manual certificate renewal

  1. Delete the existing secret to force re-issuance:
kubectl delete secret <tls-secret-name> -n <namespace>
  1. cert-manager will detect the missing secret and re-issue automatically

cert-manager pods not running

  1. Check the HelmRelease:
flux get helmrelease cert-manager -n cert-manager
  1. If the release is in a failed state, suspend and resume:
flux suspend helmrelease cert-manager -n cert-manager
flux resume helmrelease cert-manager -n cert-manager
  1. Force reconciliation:
flux reconcile helmrelease cert-manager -n cert-manager --with-source

Prevention

Escalation